[doc] Adjusted yuanrong backend doc#104
Conversation
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
5e20c9d to
a80b636
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <958208521@qq.com>
a80b636 to
e353874
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <958208521@qq.com>
41f4a62 to
0c31647
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| ## Quick Start | ||
|
|
||
| ### Prerequisites | ||
| - **Python Version**: $ \geq 3.10~and \leq 3.11 $ |
There was a problem hiding this comment.
this line is not correctly rendered by markdown. just use >= 3.10, <=3.11
| - `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker. | ||
| - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). | ||
| - `--enable_huge_tlb`: Enable huge page memory, required for >21GB shared memory on Ascend 910B. | ||
| - `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B. |
There was a problem hiding this comment.
I think it's better to remind the users to allocate huge pages before starting datasystem. you may link to datasystem huge page doc https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html
|
|
||
| Next, we will provide deployment and code examples for single-node scenarios. | ||
| For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios). | ||
| When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process: |
There was a problem hiding this comment.
it's better to tell readers that yuanrong is per-host deployment. it manages all clients on the same node, in case some users may be mistaken and think yr backend is per-client
| **NPU Transfer Options:** | ||
| - `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors. | ||
| - `worker_args` (recommended when `enable_yr_npu_transport: true`): | ||
| - `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). |
There was a problem hiding this comment.
yr manages all the specified devices. If you want to set/get tensors on npu x, you need to include the device id x in this argument.
|
|
||
| ```bash | ||
| # On head node | ||
| ray start --head --resources='{"node:192.168.0.1": 1}' |
There was a problem hiding this comment.
I remember that haichuan said resources for node ip is not necessary. if it's true, this start cmd can be simplified
There was a problem hiding this comment.
This is for controlling placements of ray actors
| TransferQueue will detect all Ray nodes and deploy datasystem workers automatically. | ||
|
|
||
| Once the configuration is set, you can run your TransferQueue + Datasystem application directly. | ||
| #### Multi-Node Demo |
There was a problem hiding this comment.
add a short line to remind the users which lines are required to be modified (node ips) before giving them a big chunk of code
| If `worker_port` or `metastore_port` is already in use, initialization will fail: | ||
|
|
||
| ``` | ||
| RuntimeError: Failed to start datasystem worker... |
There was a problem hiding this comment.
port conflict is the only possible reason of failed to start datasystem worker?
| # Clean up | ||
| dscli stop --worker_address <IP>:31501 | ||
| # Or force cleanup | ||
| pkill -f dscli |
There was a problem hiding this comment.
kill dscli or kill datasystem_worker?
| pkill -f dscli | ||
| ``` | ||
|
|
||
| ### Multi-Process Initialization |
There was a problem hiding this comment.
Users may be confused about how to init yuanrong-worker with multiple processes. This is for explaining the process of tq.init()
KaisennHu
left a comment
There was a problem hiding this comment.
Overall looks good. Some minors.
|
|
||
| **NPU Transfer Options:** | ||
| - `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors. | ||
| - `worker_args` (recommended when `enable_yr_npu_transport: true`): |
There was a problem hiding this comment.
When enable_yr_npu_transport is set to true, remote_h2d_device_ids is mandatory instead of recommended.
| 1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster | ||
| 2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes | ||
| 3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker | ||
| 4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers |
There was a problem hiding this comment.
The symbols ‘-’ are a bit strange
There was a problem hiding this comment.
I think it looks not bad. (^w^)
| # On head node | ||
| ray start --head --resources='{"node:192.168.0.1": 1}' | ||
|
|
||
| # On worker node (assume ray port of head_node is 6379) | ||
| ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}' |
There was a problem hiding this comment.
To start Ray in an NPU environment, users need to be reminded to add --resources='{"NPU": 4}' or configure ASCEND_RT_VISIBLE_DEVICES.
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Updates the Yuanrong storage-backend documentation to provide clearer installation, configuration, deployment, and troubleshooting guidance, and links the guide from the main README.
Changes:
- Added a README link to the Yuanrong usage guide.
- Restructured and expanded the Yuanrong backend guide with demos (single-node + multi-node), config explanations, and manual startup instructions.
- Added an FAQ section covering common deployment/runtime issues.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| README.md | Adds a direct link to the Yuanrong backend usage guide from the supported backends list. |
| docs/storage_backends/openyuanrong_datasystem.md | Expands and reorganizes Yuanrong backend documentation (install, demos, deployment, manual mode, FAQ). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Install Torch (recommended version: 2.8.0 or higher) | ||
| pip install torch==2.8.0 |
There was a problem hiding this comment.
Add extra annotation
| # For root users | ||
| ll /usr/local/Ascend/ascend-toolkit/latest | ||
|
|
||
| # For non-root users | ||
| ll ${HOME}/Ascend/ascend-toolkit/latest | ||
| ``` |
|
|
||
| #### Option 1: Docker Image (Recommended) | ||
|
|
||
| First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/cann) aligned with your **CANN version**, **Ascend hardware**, **OS**, and **Python version**. For examples: |
There was a problem hiding this comment.
TQ already has set openyuanrong-datasystem as optional dependency. We can use pip install TransferQueue[yuanrong] to directly install corresponding openyuanrong-datasystem
Signed-off-by: dpj135 <958208521@qq.com>
bcd05d4 to
88b8591
Compare
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
I've updated the description in the Yuanrong backend documentation, adding more usage guidance.
Main changes
transfer_queue.init()to startTransferQueue&Yuanrong.auto_init=False.